Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new AtU8 beam chunk #1078

Merged
merged 1 commit into from
Feb 1, 2017
Merged

Add new AtU8 beam chunk #1078

merged 1 commit into from
Feb 1, 2017

Conversation

josevalim
Copy link
Contributor

@josevalim josevalim commented May 31, 2016

The new chunk stores atoms encoded in UTF-8.

This is still work in progress and the following tasks are still pending:

Feedback request:

  1. beam_lib has been modified to support a new 'utf8_atoms' chunk that maps to the new "AtU8" chunk. This means accessing the old "Atom" chunk can now be missing although it is straight-forward to handle it with a case statement.
  2. list_to_atom has not been modified and it will still fail if given a character more than 255. The reason I have decided to not change list_to_atom is because its current documentation does not say it may support characters more than 255 in the future (while the binary_to_atom has a very explicit warning about such).

@ferd
Copy link
Contributor

ferd commented May 31, 2016

What is meant by an UTF8 codepoint? I thought that by the time you're in UTF-8 you no longer deal in codepoints, just in bytes?

The Unicode consortium does specify that lengths can be counted in one of 4 ways. From the FAQ using the string aनि亜𐂃 (U+0061, U+0928, U+093F, U+4E9C, U+10083)

  • bytes: UTF-8: 14, UTF-16: 12, UTF-32: 20
  • code units (position in a string with the given size): UTF-8: 14, UTF-16: 6, UTF-32: 5
  • code points (how many unicode code points are in there): 5 for all, since there is a combining mark
  • grapheme clusters: 4 for all of them, since this is a logical grouping for the length based on 'characters' as we people usually think of them, rather than the implementation of them.

The two latter are not encoding-specific. What is meant exactly by UTF-8 code points? Just code points, or it's a bad term for code units (and hence bytes)?

@psyeugenic psyeugenic added team:VM Assigned to OTP team VM team:MW feature labels May 31, 2016
@josevalim
Copy link
Contributor Author

josevalim commented May 31, 2016

Hey @ferd, I meant literally code points as in your definition:

code points (how many unicode code points are in there): 5 for all, since there is a combining mark

It is also worth mentioning the implementation today (prior to this patch) also checks code points. This can be checked in the source code or by trying a quick example:

3> binary_to_atom(binary:copy(<<"é"/utf8>>, 255), utf8).
ééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé
4> byte_size(binary:copy(<<"é"/utf8>>, 255)).
510
5> binary_to_atom(binary:copy(<<"é"/utf8>>, 256), utf8).
** exception error: a system limit has been reached
     in function  binary_to_atom/2
        called as binary_to_atom(<<"éééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé"/utf8...>>,
                                 utf8)

So this pull request is currently backwards incompatible (hence the pending tasks). The question is mostly how to implement the new binary_to_atom efficiently.

@@ -6332,7 +6355,7 @@ erts_make_stub_module(Process* p, Eterm Mod, Eterm Beam, Eterm Info)
goto error;
}
define_file(stp, "atom table", ATOM_CHUNK);
if (!load_atom_table(stp)) {
if (!load_atom_table(stp, ERTS_ATOM_ENC_LATIN1)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to go through the same if (stp->chunks[UTF8_ATOM_CHUNK].size > 0) ... charade down here, or code:make_stub_module/3 (and thus HiPE) will break. There's some tests for it in erts/emulator/test/code_SUITE*

Copy link
Contributor Author

@josevalim josevalim Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I have run the code suite and fixed this and the remaining failing tests, I have updated this PR.

@OTP-Maintainer
Copy link

Patch has passed first testings and has been assigned to be reviewed


I am a script, I am not human


@bjorng bjorng self-assigned this Jun 3, 2016
@bjorng
Copy link
Contributor

bjorng commented Jun 3, 2016

We will not have time to do a thorough review until OTP 19.0 has been released, but here is some quick feedback:

The test case lc_SUITE:effect/1 fails.

binary_to_atom(): Solution i, that is keep the existing error semantics and implement it efficiently.

Not modifying list_to_atom/1 is an interesting idea that we have not considered.

@nox
Copy link
Contributor

nox commented Jun 3, 2016

@bjorng Can we name the chunk 'JOSÉ'? PLS.

@bjorng bjorng added the waiting waiting for changes/input from author label Sep 22, 2016
@bjorng
Copy link
Contributor

bjorng commented Sep 22, 2016

We will take a closer at this PR request when you have fixed the failed test case mentioned above and rebased it on the latest master.

@nox
Copy link
Contributor

nox commented Sep 22, 2016

It is also worth mentioning the implementation today (prior to this patch) also checks code points. This can be checked in the source code or by trying a quick example:

I think you interpreted that incorrectly. Currently only UTF-8 encoded atoms that can be converted to Latin-1 are supported, so it just counts the number of Latin-1 code points, which equates the number of Latin-1 code units, which equates the number of bytes in the result.

@josevalim
Copy link
Contributor Author

@bjorng I will allocate time in the next weeks to work on all of my pending pull requests. :D

While talking to @nox, we came up with another solution for the second bullet above. @nox commented that it would be much simpler to count bytes instead of counting codepoints. However, if we count bytes and keep the limit of 255 bytes, it would be backwards incompatible because an atom made only of the letter é in latin1 will now take double the space in UTF-8, possibly triggering the system limit.

However, if we double the maximum atom size, we could keep the conceptually simpler and faster code that only counts bytes and remain backwards compatible.

From the Elixir perspective, Increasing the atom limit would be a positive change, since function names must be atoms, sometimes we may reach this limit when building functions dynamically.

@bjorng thoughts on this suggestion?

@bjorng
Copy link
Contributor

bjorng commented Sep 24, 2016

I think that the BIFs that create atoms should count code points, and let the limit be 255 code points. When converting from a list, the number of code points is the number of elements in the list (assuming that we don't try to combine code points that can be combined).

The chunk in the BEAM file can give the size of each atom in bytes, and the loader can trust the compiler not to create overlong atoms. (There could be a sanity check to abort the loading if any atom is longer than 1024 bytes.)

@bjorng
Copy link
Contributor

bjorng commented Sep 24, 2016

Rationale for counting code points: We don't want to handle support request from Klingon users that complain that they can only put 85 Klingon characters into their atoms, while speakers of most west-European languages can have 255 characters in their atoms.

@nox
Copy link
Contributor

nox commented Sep 24, 2016

What about following what ROK proposed in EEP20?

2 bytes for byte length, 2 bytes for code point count.

Eterm
erts_atom_put(const byte *name, int len, ErtsAtomEncoding enc, int trunc)
int
erts_atom_put_index(const byte *name, int len, ErtsAtomEncoding enc, int trunc)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjorng I have introduced this function that returns the atom index instead of the atom so we have control upstream if we should error with badarg or system_limit. I have two questions I would love your input on:

  1. Should the name have the erts_ prefix?
  2. Should we use #define ATOM_BAD_ENCODING and #define ATOM_MAX_CHARS instead of using integers? If so, is it fine to define those on atom.h?

Notice the new function and its return values are used in binary_to_atom (erl_unicode.c). The PR has also been rebased. Once those questions are answered, I will squash everything and we should be good to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention for the erts_ prefix is to use it for global functions. Older function, or some very often used functions, e.g. size_object does not use it though.

Copy link
Contributor

@bjorng bjorng Sep 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answer for 2: Yes, and yes.

@josevalim
Copy link
Contributor Author

josevalim commented Sep 26, 2016

@bjorng This is ready for review. I have rebased, fixed all feedback and features.

I have ran emulator, compiler and stdlib test suites with the following results:

  • emulator: One failure regarding registered sends across nodes (I inspected the code and it seems to not related be to this patch)
  • compiler: No failures
  • stdlib: One failure in the shell suite which seems unrelated

I have also added a test that compiles a module from the Erlang Abstract Format with utf8 atoms, to ensure the whole compilation chain works, and also a test for the r19 option.

@OTP-Maintainer
Copy link

Patch has passed first testings and has been assigned to be reviewed


I am a script, I am not human


@OTP-Maintainer
Copy link

Patch has passed first testings and has been assigned to be reviewed


I am a script, I am not human


@bjorng bjorng removed the waiting waiting for changes/input from author label Oct 5, 2016
@bjorng
Copy link
Contributor

bjorng commented Oct 12, 2016

Three test cases fail in compile_SUITE: core, core_roundtrip, and asm.

I updated the primary bootstrap and compiled from a clean repository.

@bjorng bjorng added the waiting waiting for changes/input from author label Oct 12, 2016
@josevalim
Copy link
Contributor Author

I updated the primary bootstrap and compiled from a clean repository.

How do I update the primary bootstrap? I remember doing a clean build but I did not touch the bootstrap files. Thanks!

@bjorng bjorng added testing currently being tested, tag is used by OTP internal CI and removed waiting waiting for changes/input from author labels Dec 7, 2016
@bjorng bjorng removed the testing currently being tested, tag is used by OTP internal CI label Dec 14, 2016
@bjorng
Copy link
Contributor

bjorng commented Dec 19, 2016

I found and fixed another minor bug. Please squash and rebase on latest master.

@bjorng bjorng added the waiting waiting for changes/input from author label Dec 19, 2016
@josevalim
Copy link
Contributor Author

Done and done!

Btw the reducing memory during compilation I had to rebase on may help Elixir too. 👍

@bjorng bjorng removed the waiting waiting for changes/input from author label Dec 19, 2016
@bjorng bjorng added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Jan 9, 2017
@bjorng
Copy link
Contributor

bjorng commented Jan 13, 2017

Can you please rebase again?

We will test the branch in our daily builds again to see that there are no remaining failed test cases.

@josevalim
Copy link
Contributor Author

@bjorng done!

@bjorng bjorng added the testing currently being tested, tag is used by OTP internal CI label Jan 14, 2017
@bjorng
Copy link
Contributor

bjorng commented Jan 25, 2017

This branch no longer seems to cause any problems in our daily builds. We are thinking about merging it soon. More work is probably needed in some applications, for example to ensure that "~ts" is used as format specifier when an atom is displayed, but we can do that in separate branches.

However, I think that the documentation should be updated in this branch to reflect the changes, at least the BIFs for the atom BIFs. Do you think you could update the documentation?

@josevalim
Copy link
Contributor Author

I will push the docs soon. :) Two questions:

  1. Regarding ~ts, I understand why it is needed for binaries and char lists, but it shouldn't be necessary for atoms assuming that atoms do have the encoding they are written in? So I suggest making ~s work regardless of the atom encoding. Or do you see reasons where this would be a bad idea™?

  2. Should we explicitly document module names can now be UTF-8? If so, I would like to add a test to compile_SUITE where we compile Erlang Forms module with a UTF-8 atom as name.

@josevalim
Copy link
Contributor Author

I have pushed a separate commit with docs. Once review is done, I can squash it all together.

@bjorng
Copy link
Contributor

bjorng commented Jan 26, 2017

  1. Making ~s work for any atom is an interesting idea. I will discuss it with the team.

  2. We have not reached a final decision yet. Our current thinking is that we will allow Unicode characters in module names, but we will not recommend it because of potential portability problems having to do with the file names (Linux does not enforce any particular encoding for file names, and their could be other potential pitfalls when moving files from one operating system to another). The reason that we will probably allow it is because those potential problem already exists for module names with Latin1 characters.

@josevalim
Copy link
Contributor Author

Our current thinking is that we will allow Unicode characters in module names, but we will not recommend it because of potential portability problems having to do with the file names

I agree. Let me know what are your decisions regarding 1 and 2 and I will gladly add tests and improve the docs.

@bjorng
Copy link
Contributor

bjorng commented Jan 30, 2017

There may some time before we can reach a decision. We will merge this branch and do further corrections/improvement in separate pull requests.

Please squash the commits. When you have done it, we will run the branch once more in our daily builds and then merge it.

The new chunk stores atoms encoded in UTF-8.

beam_lib has also been modified to handle the new
'utf8_atoms' attribute while the 'atoms' attribute
may be a missing chunk from now on.

The binary_to_atom/2 BIF can now encode any utf8
binary with up to 255 characters.

The list_to_atom/1 BIF can now accept codepoints
higher than 255 with up to 255 characters (thanks
to Björn Gustavsson).
@josevalim
Copy link
Contributor Author

Squashed and pushed.

@nox
Copy link
Contributor

nox commented Jan 30, 2017

do further corrections/improvement in separate pull requests.

Like renaming the chunk to 'JOSÉ', right?

@bjorng bjorng merged commit 26b59df into erlang:master Feb 1, 2017
@josevalim
Copy link
Contributor Author

🎉

@KronicDeth
Copy link
Contributor

Support for decompiling AtU8 chunks has been added to IntelliJ Elixir in KronicDeth/intellij-elixir#777, which will be released in version 6.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature team:VM Assigned to OTP team VM testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants